GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 10 - Model Selection And Boosting/XGBoost/[Python] XGBoost.ipynb
¹³⁴¹ views

Kernel: Python 3

XGBoost

Data Preprocessing

In [1]:

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
plt.rcParams['figure.figsize'] = [14, 8]

In [2]:

# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')

In [3]:

dataset.head(1)

Out[3]:

In [4]:

X = dataset.iloc[:, [3, 4, 6, 7, 8, 9, 10, 11, 12]].values

In [5]:

y = dataset.iloc[:, 13].values

In [6]:

X[0]

Out[6]:

array([619, 'France', 42, 2, 0.0, 1, 1, 1, 101348.88], dtype=object)

In [7]:

y[0]

Out[7]:

1

In [8]:

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

In [9]:

X[0:2]

Out[9]:

array([[  0.00000000e+00,   0.00000000e+00,   6.19000000e+02,
          4.20000000e+01,   2.00000000e+00,   0.00000000e+00,
          1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          1.01348880e+05],
       [  0.00000000e+00,   1.00000000e+00,   6.08000000e+02,
          4.10000000e+01,   1.00000000e+00,   8.38078600e+04,
          1.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          1.12542580e+05]])

In [10]:

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Fitting XGBoost to the training set

In [11]:

from xgboost import XGBClassifier

In [12]:

classifier = XGBClassifier()

In [13]:

classifier.fit(X_train, y_train)

Out[13]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

Predicting the Test set results

In [14]:

y_pred = classifier.predict(X_test)

In [15]:

y_pred[0:10]

Out[15]:

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 1])

In [16]:

y_test[0:10]

Out[16]:

array([0, 1, 0, 0, 0, 1, 0, 0, 1, 1])

Making the confussion Matrix

In [17]:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

Out[17]:

array([[1532,   63],
       [ 203,  202]])

Calculating Accuracy

In [18]:

(cm[0][0]+cm[1][1])/np.sum(cm)

Out[18]:

0.86699999999999999

Applying k-Fold Cross Validation

In [19]:

from sklearn.model_selection import cross_val_score

In [20]:

accuracies = cross_val_score(estimator = classifier, 
                             X = X_train,
                             y = y_train,
                             cv = 10)
accuracies # 10 test set accuracies

Out[20]:

array([ 0.87640449,  0.8639201 ,  0.88125   ,  0.86625   ,  0.86375   ,
        0.855     ,  0.865     ,  0.8575    ,  0.8485607 ,  0.87359199])

In [21]:

np.mean(accuracies) # mean of accuracies

Out[21]:

0.86512272851207572

In [22]:

np.std(accuracies) # startdard deviation of accuracies

Out[22]:

0.0094793902817781814

Applying Grid Search to find the best model and the best parameters (Optional)

In [23]:

from sklearn.model_selection import GridSearchCV

In [24]:

help(XGBClassifier())

Out[24]:

Help on XGBClassifier in module xgboost.sklearn object:

class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
 |  Implementation of the scikit-learn API for XGBoost classification.
 |  
 |      Parameters
 |  ----------
 |  max_depth : int
 |      Maximum tree depth for base learners.
 |  learning_rate : float
 |      Boosting learning rate (xgb's "eta")
 |  n_estimators : int
 |      Number of boosted trees to fit.
 |  silent : boolean
 |      Whether to print messages while running boosting.
 |  objective : string or callable
 |      Specify the learning task and the corresponding learning objective or
 |      a custom objective function to be used (see note below).
 |  booster: string
 |      Specify which booster to use: gbtree, gblinear or dart.
 |  nthread : int
 |      Number of parallel threads used to run xgboost.  (Deprecated, please use n_jobs)
 |  n_jobs : int
 |      Number of parallel threads used to run xgboost.  (replaces nthread)
 |  gamma : float
 |      Minimum loss reduction required to make a further partition on a leaf node of the tree.
 |  min_child_weight : int
 |      Minimum sum of instance weight(hessian) needed in a child.
 |  max_delta_step : int
 |      Maximum delta step we allow each tree's weight estimation to be.
 |  subsample : float
 |      Subsample ratio of the training instance.
 |  colsample_bytree : float
 |      Subsample ratio of columns when constructing each tree.
 |  colsample_bylevel : float
 |      Subsample ratio of columns for each split, in each level.
 |  reg_alpha : float (xgb's alpha)
 |      L1 regularization term on weights
 |  reg_lambda : float (xgb's lambda)
 |      L2 regularization term on weights
 |  scale_pos_weight : float
 |      Balancing of positive and negative weights.
 |  base_score:
 |      The initial prediction score of all instances, global bias.
 |  seed : int
 |      Random number seed.  (Deprecated, please use random_state)
 |  random_state : int
 |      Random number seed.  (replaces seed)
 |  missing : float, optional
 |      Value in the data which needs to be present as a missing value. If
 |      None, defaults to np.nan.
 |  **kwargs : dict, optional
 |      Keyword arguments for XGBoost Booster object.  Full documentation of parameters can
 |      be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.md.
 |      Attempting to set a parameter via the constructor args and **kwargs dict simultaneously
 |      will result in a TypeError.
 |      Note:
 |          **kwargs is unsupported by Sklearn.  We do not guarantee that parameters passed via
 |          this argument will interact properly with Sklearn.
 |  
 |  Note
 |  ----
 |  A custom objective function can be provided for the ``objective``
 |  parameter. In this case, it should have the signature
 |  ``objective(y_true, y_pred) -> grad, hess``:
 |  
 |  y_true: array_like of shape [n_samples]
 |      The target values
 |  y_pred: array_like of shape [n_samples]
 |      The predicted values
 |  
 |  grad: array_like of shape [n_samples]
 |      The value of the gradient for each sample point.
 |  hess: array_like of shape [n_samples]
 |      The value of the second derivative for each sample point
 |  
 |  Method resolution order:
 |      XGBClassifier
 |      XGBModel
 |      sklearn.base.BaseEstimator
 |      sklearn.base.ClassifierMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  evals_result(self)
 |      Return the evaluation results.
 |      
 |      If eval_set is passed to the `fit` function, you can call evals_result() to
 |      get evaluation results for all passed eval_sets. When eval_metric is also
 |      passed to the `fit` function, the evals_result will contain the eval_metrics
 |      passed to the `fit` function
 |      
 |      Returns
 |      -------
 |      evals_result : dictionary
 |      
 |      Example
 |      -------
 |      param_dist = {'objective':'binary:logistic', 'n_estimators':2}
 |      
 |      clf = xgb.XGBClassifier(**param_dist)
 |      
 |      clf.fit(X_train, y_train,
 |              eval_set=[(X_train, y_train), (X_test, y_test)],
 |              eval_metric='logloss',
 |              verbose=True)
 |      
 |      evals_result = clf.evals_result()
 |      
 |      The variable evals_result will contain:
 |      {'validation_0': {'logloss': ['0.604835', '0.531479']},
 |       'validation_1': {'logloss': ['0.41965', '0.17686']}}
 |  
 |  fit(self, X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
 |      Fit gradient boosting classifier
 |      
 |      Parameters
 |      ----------
 |      X : array_like
 |          Feature matrix
 |      y : array_like
 |          Labels
 |      sample_weight : array_like
 |          Weight for each instance
 |      eval_set : list, optional
 |          A list of (X, y) pairs to use as a validation set for
 |          early-stopping
 |      eval_metric : str, callable, optional
 |          If a str, should be a built-in evaluation metric to use. See
 |          doc/parameter.md. If callable, a custom evaluation metric. The call
 |          signature is func(y_predicted, y_true) where y_true will be a
 |          DMatrix object such that you may need to call the get_label
 |          method. It must return a str, value pair where the str is a name
 |          for the evaluation and value is the value of the evaluation
 |          function. This objective is always minimized.
 |      early_stopping_rounds : int, optional
 |          Activates early stopping. Validation error needs to decrease at
 |          least every <early_stopping_rounds> round(s) to continue training.
 |          Requires at least one item in evals.  If there's more than one,
 |          will use the last. Returns the model from the last iteration
 |          (not the best one). If early stopping occurs, the model will
 |          have three additional fields: bst.best_score, bst.best_iteration
 |          and bst.best_ntree_limit.
 |          (Use bst.best_ntree_limit to get the correct value if num_parallel_tree
 |          and/or num_class appears in the parameters)
 |      verbose : bool
 |          If `verbose` and an evaluation set is used, writes the evaluation
 |          metric measured on the validation set to stderr.
 |      xgb_model : str
 |          file name of stored xgb model or 'Booster' instance Xgb model to be
 |          loaded before training (allows training continuation).
 |  
 |  predict(self, data, output_margin=False, ntree_limit=0)
 |  
 |  predict_proba(self, data, output_margin=False, ntree_limit=0)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from XGBModel:
 |  
 |  __setstate__(self, state)
 |  
 |  apply(self, X, ntree_limit=0)
 |      Return the predicted leaf every tree for each sample.
 |      
 |      Parameters
 |      ----------
 |      X : array_like, shape=[n_samples, n_features]
 |          Input features matrix.
 |      
 |      ntree_limit : int
 |          Limit number of trees in the prediction; defaults to 0 (use all trees).
 |      
 |      Returns
 |      -------
 |      X_leaves : array_like, shape=[n_samples, n_trees]
 |          For each datapoint x in X and for each tree, return the index of the
 |          leaf x ends up in. Leaves are numbered within
 |          ``[0; 2**(self.max_depth+1))``, possibly with gaps in the numbering.
 |  
 |  get_booster(self)
 |      Get the underlying xgboost Booster of this model.
 |      
 |      This will raise an exception when fit was not called
 |      
 |      Returns
 |      -------
 |      booster : a xgboost booster of underlying model
 |  
 |  get_params(self, deep=False)
 |      Get parameters.
 |  
 |  get_xgb_params(self)
 |      Get xgboost type parameters.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from XGBModel:
 |  
 |  feature_importances_
 |      Returns
 |      -------
 |      feature_importances_ : array of shape = [n_features]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

In [40]:

# Tried various parameters, this one is the best till now.
parameters = [{'max_depth': [3], 'learning_rate': [0.1], 'n_estimators': [250], 'booster': ['gbtree', 'gblinear', 'dart']}]

grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)

In [41]:

best_accuracy = grid_search.best_score_
best_accuracy

Out[41]:

0.86550000000000005

In [42]:

best_parameters = grid_search.best_params_
best_parameters

Out[42]:

{'booster': 'gbtree',
 'learning_rate': 0.1,
 'max_depth': 3,
 'n_estimators': 250}

XGBoost

Data Preprocessing

Fitting XGBoost to the training set

Predicting the Test set results

Making the confussion Matrix

Calculating Accuracy

Applying k-Fold Cross Validation

Applying Grid Search to find the best model and the best parameters (Optional)

Product

Resources

Company